<!DOCTYPE html>
<html lang="en">

<head>
  <meta charset="utf-8">
  <meta name="viewport" content="width=device-width, initial-scale=1, maximum-scale=1">
  <meta name="google-site-verification" content="VGgQeH6NiuAKspyCFT7dqUNmNhg6RJoMYQXErdy0jgE" />
  <meta name="baidu-site-verification" content="code-yNmdsKJ9GP" />
  
  
  <meta name="keywords" content="Elasticsearch,elasticsearch,搭建，环境搭建,">
  
  
  <meta name="description" content="Elasticsearch 环境搭建">
  
  <title>
    7. Elasticsearch 相关性和相关性算分 |
    
    思远程序
  </title>
  <!-- Icon -->
  
    <link rel="shortcut icon" href="/favicon.ico">
    
  
<link rel="stylesheet" href="/css/style.css">

  
  
<link rel="stylesheet" href="/fancybox/jquery.fancybox.min.css">

  
  
<script src="/js/pace.min.js"></script>

<meta name="generator" content="Hexo 6.2.0"></head>

<body>
  <main class="content">
    <section class="outer">
  <article id="post-7. Elasticsearch 相关性和相关性算分" class="article article-type-post" itemscope
  itemprop="blogPost" data-scroll-reveal>

  <div class="article-inner">
    
    <header class="article-header">
      

<h1 class="article-title" itemprop="name">
  7. Elasticsearch 相关性和相关性算分
</h1>



    </header>
    

    
    <div class="article-meta">
      <a href="/2022/07/23/7.%20Elasticsearch%20%E7%9B%B8%E5%85%B3%E6%80%A7%E5%92%8C%E7%9B%B8%E5%85%B3%E6%80%A7%E7%AE%97%E5%88%86/" class="article-date">
  <time datetime="2022-07-23T08:00:00.000Z" itemprop="datePublished">2022-07-23</time>
</a>
      
<div class="article-category">
  <a class="article-category-link" href="/categories/%E5%88%86%E5%B8%83%E5%BC%8F%E6%A1%86%E6%9E%B6/">分布式框架</a>
</div>

    </div>
    

    
    
<div class="tocbot"></div>

    

    <div class="article-entry" itemprop="articleBody">
      
      
      
      <h4 id="相关性和相关性算分"><a href="#相关性和相关性算分" class="headerlink" title="相关性和相关性算分"></a>相关性和相关性算分</h4><p>搜索是用户和搜索引擎的对话，用户关心的是搜索结果的相关性 </p>
<ul>
<li>是否可以找到所有相关的内容；</li>
<li>有多少不相关的内容被返回了；</li>
<li>文档的打分是否合理 结合业务需求，平衡结果排名；</li>
</ul>
<span id="more"></span>

<p>如何衡量相关性： </p>
<ul>
<li>Precision(查准率)―尽可能返回较少的无关文档；</li>
<li>Recall(查全率)–尽量返回较多的相关文档；</li>
<li>Ranking -是否能够按照相关度进行排序；</li>
</ul>
<h4 id="相关性（Relevance）"><a href="#相关性（Relevance）" class="headerlink" title="相关性（Relevance）"></a>相关性（Relevance）</h4><p>搜索的相关性算分，描述了一个文档和查询语句匹配的程度。ES 会对每个匹配查询条件的 结果进行算分_score。打分的本质是排序，需要把最符合用户需求的文档排在前面。<br>ES 5之前，默认的相关性算分采用TF-IDF，现在采用BM 25。</p>
<p>如下例子：显而易见，查询JAVA多线程设计模式，文档id为2,3的文档的算分更高：</p>
<table>
<thead>
<tr>
<th><strong>关键词</strong></th>
<th><strong>文档ID</strong></th>
</tr>
</thead>
<tbody><tr>
<td>JAVA</td>
<td>1,2,3</td>
</tr>
<tr>
<td>设计模式</td>
<td>1,2,3,4,5,6</td>
</tr>
<tr>
<td>多线程</td>
<td>2,3,7,9</td>
</tr>
</tbody></table>
<h5 id="TF-IDF"><a href="#TF-IDF" class="headerlink" title="TF-IDF"></a>TF-IDF</h5><p>TF-IDF（term frequency–inverse document frequency）是一种用于信息检索与数据挖 掘的常用加权技术。</p>
<ul>
<li>TF-IDF被公认为是信息检索领域最重要的发明，除了在信息检索，在文献分类和 其他相关领域有着非常广泛的应用。 </li>
<li>IDF的概念，最早是剑桥大学的“斯巴克.琼斯”提出 <ul>
<li>1972年——“关键词特殊性的统计解释和它在文献检索中的 应用”，但是没有从理论上解释IDF应该是用log(全部文档数&#x2F;检索 词出现过的文档总数)，而不是其他函数，也没有做进一步的研究；</li>
<li>1970，1980年代萨尔顿和罗宾逊，进行了进一步的证明和研 究，并用香农信息论做了证明 <code>http://www.staff.city.ac.uk/~sb317/papers/foundations_bm25 _review.pdf</code></li>
</ul>
</li>
<li>现代搜索引擎，对TF-IDF进行了大量细微的优化</li>
</ul>
<p>Lucene中的TF-IDF评分公式：<br><img src="https://cdn.nlark.com/yuque/0/2022/png/1445568/1658563153586-0800815f-9b91-4720-96da-71e17a9a8c52.png#clientId=ucae64a51-2449-4&crop=0&crop=0&crop=1&crop=1&from=paste&height=486&id=u4b1da777&margin=%5Bobject%20Object%5D&name=image.png&originHeight=486&originWidth=1866&originalType=binary&ratio=1&rotation=0&showTitle=false&size=276899&status=done&style=none&taskId=ue1f1363f-9668-4bd7-b3f6-beb1166576b&title=&width=1866" alt="image.png"></p>
<ol>
<li>TF是词频(Term Frequency)</li>
</ol>
<p>检索词在文档中出现的频率越高，相关性也越高。 </p>
<ol start="2">
<li>IDF是逆向文本频率(Inverse Document Frequency)</li>
</ol>
<p>每个检索词在索引中出现的频率，频率越高，相关性越低。 </p>
<ol start="3">
<li>字段长度归一值（ field-length norm）</li>
</ol>
<p>字段的长度是多少？字段越短，字段的权重越高。检索词出现在一个内容短的 title 要比同 样的词出现在一		个内容长的 content 字段权重更大。<br>以上三个因素——词频（term frequency）、逆向文档频率（inverse document frequency）和字段长度归一值（field-length norm）——是在索引时计算并存储的，最后将它们结合在一起计算单个词在特定文档中的权重。</p>
<h5 id="BM25"><a href="#BM25" class="headerlink" title="BM25"></a>BM25</h5><p>BM25 就是对 TF-IDF 算法的改进，对于 TF-IDF 算法，TF(t) 部分的值越大，整个公式返 回的值就会越大。<br>BM25 就针对这点进行来优化，随着TF(t) 的逐步加大，该算法的返回值 会趋于一个数值。<br>从ES 5开始，默认算法改为BM 25 和经典的TF-IDF相比,当TF无限增加时，BM 25算分会趋于一个数值。<br><img src="https://cdn.nlark.com/yuque/0/2022/png/1445568/1658564092589-ba02e27a-094f-4b44-b27f-97aa942ebb1a.png#clientId=ucae64a51-2449-4&crop=0&crop=0&crop=1&crop=1&from=paste&height=838&id=u61ef889f&margin=%5Bobject%20Object%5D&name=image.png&originHeight=838&originWidth=1356&originalType=binary&ratio=1&rotation=0&showTitle=false&size=501497&status=done&style=none&taskId=u5751758d-42ba-4507-ac27-de4819400f0&title=&width=1356" alt="image.png"><br>BM 25的公式：<br><img src="https://cdn.nlark.com/yuque/0/2022/png/1445568/1658564114062-3f1e7589-cf1e-4748-b554-1026116b3097.png#clientId=ucae64a51-2449-4&crop=0&crop=0&crop=1&crop=1&from=paste&height=210&id=ua4163824&margin=%5Bobject%20Object%5D&name=image.png&originHeight=210&originWidth=1826&originalType=binary&ratio=1&rotation=0&showTitle=false&size=111957&status=done&style=none&taskId=uf7a8d376-0c53-4ac0-902d-d7e37f3e0fd&title=&width=1826" alt="image.png"></p>
<h5 id="通过Explain-API查看TF-IDF"><a href="#通过Explain-API查看TF-IDF" class="headerlink" title="通过Explain API查看TF-IDF"></a>通过Explain API查看TF-IDF</h5><figure class="highlight json"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br></pre></td><td class="code"><pre><span class="line">DELETE /test_template</span><br><span class="line"></span><br><span class="line">PUT /test_score/_bulk</span><br><span class="line"><span class="punctuation">&#123;</span><span class="attr">&quot;index&quot;</span><span class="punctuation">:</span><span class="punctuation">&#123;</span><span class="attr">&quot;_id&quot;</span><span class="punctuation">:</span><span class="number">1</span><span class="punctuation">&#125;</span><span class="punctuation">&#125;</span></span><br><span class="line"><span class="punctuation">&#123;</span><span class="attr">&quot;content&quot;</span><span class="punctuation">:</span><span class="string">&quot;we use Elasticsearch to power the search&quot;</span><span class="punctuation">&#125;</span></span><br><span class="line"><span class="punctuation">&#123;</span><span class="attr">&quot;index&quot;</span><span class="punctuation">:</span><span class="punctuation">&#123;</span><span class="attr">&quot;_id&quot;</span><span class="punctuation">:</span><span class="number">2</span><span class="punctuation">&#125;</span><span class="punctuation">&#125;</span></span><br><span class="line"><span class="punctuation">&#123;</span><span class="attr">&quot;content&quot;</span><span class="punctuation">:</span><span class="string">&quot;we like elasticsearch&quot;</span><span class="punctuation">&#125;</span></span><br><span class="line"><span class="punctuation">&#123;</span><span class="attr">&quot;index&quot;</span><span class="punctuation">:</span><span class="punctuation">&#123;</span><span class="attr">&quot;_id&quot;</span><span class="punctuation">:</span><span class="number">3</span><span class="punctuation">&#125;</span><span class="punctuation">&#125;</span></span><br><span class="line"><span class="punctuation">&#123;</span><span class="attr">&quot;content&quot;</span><span class="punctuation">:</span><span class="string">&quot;Thre scoring of documents is caculated by the scoring formula&quot;</span><span class="punctuation">&#125;</span></span><br><span class="line"><span class="punctuation">&#123;</span><span class="attr">&quot;index&quot;</span><span class="punctuation">:</span><span class="punctuation">&#123;</span><span class="attr">&quot;_id&quot;</span><span class="punctuation">:</span><span class="number">4</span><span class="punctuation">&#125;</span><span class="punctuation">&#125;</span></span><br><span class="line"><span class="punctuation">&#123;</span><span class="attr">&quot;content&quot;</span><span class="punctuation">:</span><span class="string">&quot;you know,for search&quot;</span><span class="punctuation">&#125;</span></span><br><span class="line"></span><br><span class="line"># 查到 <span class="number">2</span> 条，此时 <span class="number">2</span> 在前面是因为 文档<span class="number">2</span> 比 文档<span class="number">1</span> 短，影响算分；</span><br><span class="line">GET /test_score/_search</span><br><span class="line"><span class="punctuation">&#123;</span></span><br><span class="line">  <span class="attr">&quot;explain&quot;</span><span class="punctuation">:</span> <span class="keyword">true</span><span class="punctuation">,</span></span><br><span class="line">  <span class="attr">&quot;query&quot;</span><span class="punctuation">:</span> <span class="punctuation">&#123;</span></span><br><span class="line">    <span class="attr">&quot;match&quot;</span><span class="punctuation">:</span> <span class="punctuation">&#123;</span></span><br><span class="line">      <span class="attr">&quot;content&quot;</span><span class="punctuation">:</span> <span class="string">&quot;elasticsearch&quot;</span></span><br><span class="line">    <span class="punctuation">&#125;</span></span><br><span class="line">  <span class="punctuation">&#125;</span></span><br><span class="line"><span class="punctuation">&#125;</span></span><br><span class="line"></span><br></pre></td></tr></table></figure>
<p><img src="https://cdn.nlark.com/yuque/0/2022/png/1445568/1658566785852-b6e69b2c-ea2b-46a5-b3df-f9aa8bfdb453.png#clientId=ucae64a51-2449-4&crop=0&crop=0&crop=1&crop=1&from=paste&height=912&id=u1b5f2cdd&margin=%5Bobject%20Object%5D&name=image.png&originHeight=912&originWidth=1548&originalType=binary&ratio=1&rotation=0&showTitle=false&size=269268&status=done&style=none&taskId=udc264db0-817d-4717-a05c-1a2f6aeefeb&title=&width=1548" alt="image.png"></p>
<h5 id="Boosting"><a href="#Boosting" class="headerlink" title="Boosting"></a>Boosting</h5><p>Boosting是控制相关度的一种手段。 参数boost的含义： </p>
<ol>
<li>当boost &gt; 1时，打分的权重相对性提升；</li>
<li>当0 &lt; boost &lt;1时，打分的权重相对性降低；</li>
<li>当boost &lt;0时，贡献负分；</li>
</ol>
<p>返回匹配positive查询的文档并降低匹配negative查询的文档相似度分。这样就可以在不排 除某些文档的前提下对文档进行查询,搜索结果中存在只不过相似度分数相比正常匹配的要低；</p>
<figure class="highlight json"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br></pre></td><td class="code"><pre><span class="line">GET /test_score/_search</span><br><span class="line"><span class="punctuation">&#123;</span></span><br><span class="line">  <span class="attr">&quot;query&quot;</span><span class="punctuation">:</span> <span class="punctuation">&#123;</span></span><br><span class="line">    <span class="attr">&quot;boosting&quot;</span><span class="punctuation">:</span> <span class="punctuation">&#123;</span></span><br><span class="line">      <span class="attr">&quot;positive&quot;</span><span class="punctuation">:</span> <span class="punctuation">&#123;</span></span><br><span class="line">        <span class="attr">&quot;term&quot;</span><span class="punctuation">:</span> <span class="punctuation">&#123;</span></span><br><span class="line">          <span class="attr">&quot;content&quot;</span><span class="punctuation">:</span> <span class="string">&quot;elasticsearch&quot;</span></span><br><span class="line">        <span class="punctuation">&#125;</span></span><br><span class="line">      <span class="punctuation">&#125;</span><span class="punctuation">,</span></span><br><span class="line">      <span class="attr">&quot;negative&quot;</span><span class="punctuation">:</span> <span class="punctuation">&#123;</span></span><br><span class="line">        <span class="attr">&quot;term&quot;</span><span class="punctuation">:</span> <span class="punctuation">&#123;</span></span><br><span class="line">          <span class="attr">&quot;content&quot;</span><span class="punctuation">:</span> <span class="string">&quot;like&quot;</span></span><br><span class="line">        <span class="punctuation">&#125;</span></span><br><span class="line">      <span class="punctuation">&#125;</span><span class="punctuation">,</span></span><br><span class="line">      <span class="attr">&quot;negative_boost&quot;</span><span class="punctuation">:</span> <span class="number">0.2</span></span><br><span class="line">    <span class="punctuation">&#125;</span></span><br><span class="line">  <span class="punctuation">&#125;</span></span><br><span class="line"><span class="punctuation">&#125;</span></span><br></pre></td></tr></table></figure>
<p><img src="https://cdn.nlark.com/yuque/0/2022/png/1445568/1658567171084-e9532676-9826-4edb-b9f4-d1954b2a007d.png#clientId=ucae64a51-2449-4&crop=0&crop=0&crop=1&crop=1&from=paste&height=722&id=u588df449&margin=%5Bobject%20Object%5D&name=image.png&originHeight=722&originWidth=1282&originalType=binary&ratio=1&rotation=0&showTitle=false&size=159081&status=done&style=none&taskId=u3c3b9e85-4869-4a80-a3df-5fbe59f2cf8&title=&width=1282" alt="image.png"><br>应用场景：希望包含了某项内容的结果不是不出现，而是排序靠后。</p>

      
    </div>
    <footer class="article-footer">
      <a data-url="https://siyit.gitee.io/2022/07/23/7.%20Elasticsearch%20%E7%9B%B8%E5%85%B3%E6%80%A7%E5%92%8C%E7%9B%B8%E5%85%B3%E6%80%A7%E7%AE%97%E5%88%86/" data-id="cl6bukbtb000ysu6fbuih0lzh" class="article-share-link">
        分享
      </a>
      
<ul class="article-tag-list" itemprop="keywords"><li class="article-tag-list-item"><a class="article-tag-list-link" href="/tags/Elasticsearch/" rel="tag">Elasticsearch</a></li></ul>

    </footer>

  </div>

  
  
<nav class="article-nav">
  
  <a href="/2022/07/23/8.%20Elasticsearch%20%E5%B8%83%E5%B0%94%E6%9F%A5%E8%AF%A2%20bool%20Query/" class="article-nav-link">
    <strong class="article-nav-caption">前一篇</strong>
    <div class="article-nav-title">
      
      8. Elasticsearch 布尔查询 bool Query
      
    </div>
  </a>
  
  
  <a href="/2022/07/23/6.%20Elasticsearch%20%E7%BB%93%E6%9E%84%E5%8C%96%E6%90%9C%E7%B4%A2/" class="article-nav-link">
    <strong class="article-nav-caption">后一篇</strong>
    <div class="article-nav-title">6. Elasticsearch 结构化搜索</div>
  </a>
  
</nav>

  

  
  
<div class="vcomments" id="vcomments"></div>

<script src="https://unpkg.com/valine/dist/Valine.min.js"></script>

<script>
  new Valine({
    el: '#vcomments',
    appId: 'A7Ny5JW4l2XoShLWoQfpND2b-gzGzoHsz',
    appKey: 'lTbAjSoXEDQETkbAcE4zpYpu',
    notify: 'true',
    verify: 'true',
    avatar: 'identicon',
    pageSize: '10',
    placeholder: '请输入...'
  })
</script>

  
  

</article>
</section>
    <footer class="footer">
  <div class="outer">
    <div class="float-right">
      <ul class="list-inline">
  
  <li><i class="fe fe-smile-alt"></i> <span id="busuanzi_value_site_uv"></span></li>
  
  <li><i class="fe fe-bookmark"></i> <span id="busuanzi_value_page_pv"></span></li>
  
</ul>
    </div>
    <ul class="list-inline">
      <li>思远程序 &copy; 2022</li>
      
        <li></li>
      
      <li>Powered by <a href="http://hexo.io/" target="_blank">Hexo</a></li>
      <li>theme  <a target="_blank" rel="noopener" href="https://github.com/zhwangart/hexo-theme-ocean">Ocean</a></li>
    </ul>
  </div>
</footer>
  </main>
  <aside class="sidebar">
    <button class="navbar-toggle"></button>
<nav class="navbar">
  
  <div class="logo">
    <a href="/"><img src="/images/hexo.svg" alt="思远程序"></a>
  </div>
  
  <ul class="nav nav-main">
    
    <li class="nav-item">
      <a class="nav-item-link" href="/">主页</a>
    </li>
    
    <li class="nav-item">
      <a class="nav-item-link" href="/archives">归档</a>
    </li>
    
    <li class="nav-item">
      <a class="nav-item-link" href="/recommend">推荐</a>
    </li>
    
    <li class="nav-item">
      <a class="nav-item-link" href="/gallery">相册</a>
    </li>
    
    <li class="nav-item">
      <a class="nav-item-link" href="/favorites">收藏</a>
    </li>
    
    <li class="nav-item">
      <a class="nav-item-link" href="/about">关于</a>
    </li>
    
    <li class="nav-item">
      <a class="nav-item-link nav-item-search" title="搜索">
        <i class="fe fe-search"></i>
        搜索
      </a>
    </li>
  </ul>
</nav>
<nav class="navbar navbar-bottom">
  <ul class="nav">
    <li class="nav-item">
      <div class="totop" id="totop">
  <i class="fe fe-rocket"></i>
</div>
    </li>
    <li class="nav-item">
      
      <a class="nav-item-link" target="_blank" href="/atom.xml" title="RSS Feed">
        <i class="fe fe-feed"></i>
      </a>
      
    </li>
  </ul>
</nav>
<div class="search-form-wrap">
  <div class="local-search local-search-plugin">
  <input type="search" id="local-search-input" class="local-search-input" placeholder="Search...">
  <div id="local-search-result" class="local-search-result"></div>
</div>
</div>
  </aside>
  
<script src="/js/jquery-2.0.3.min.js"></script>


<script src="/js/jquery.justifiedGallery.min.js"></script>


<script src="/js/lazyload.min.js"></script>


<script src="/js/busuanzi-2.3.pure.min.js"></script>



<script src="/fancybox/jquery.fancybox.min.js"></script>





<script src="/js/tocbot.min.js"></script>


<script>
  // Tocbot_v4.7.0  http://tscanlin.github.io/tocbot/
  tocbot.init({
    tocSelector: '.tocbot',
    contentSelector: '.article-entry',
    headingSelector: 'h1, h2, h3, h4, h5, h6',
    hasInnerContainers: true,
    scrollSmooth: true,
    positionFixedSelector: '.tocbot',
    positionFixedClass: 'is-position-fixed',
    fixedSidebarOffset: 'auto',
  });
</script>



<script src="/js/ocean.js"></script>

</body>

</html>